Skip to content

1 delete orphan files #35

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

1 delete orphan files #35

wants to merge 9 commits into from

Conversation

subkanthi
Copy link
Collaborator

@subkanthi subkanthi commented Jun 23, 2025

closes:#1

@subkanthi subkanthi linked an issue Jun 23, 2025 that may be closed by this pull request

// Remove orphans.
OrphanFileScanner orphanFileScanner = new OrphanFileScanner(table);
orphanFileScanner.removeOrphanedFiles(olderThanMillis, false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want a shorter ttl here

import software.amazon.awssdk.services.s3.model.ListObjectsV2Response;

public class OrphanFileScanner {
private static final Logger LOG = LoggerFactory.getLogger(OrphanFileScanner.class);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

called logger in the rest of the codebase

String bucket = location.replace("s3://", "").split("/")[0];
String prefix = location.replace("s3://" + bucket + "/", "");

S3Client s3 = S3.newClient(true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

com.altinity.ice.cli.internal.* shouldn't be used here; true indicates "anonymous" access -> won't work;
can we use table io here (so that impl would be not tied to s3)?

deletedFiles.forEach(f -> LOG.info("Deleted: {}", f));
}

executor.shutdownNow();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already called shutdown on line 120

return allFiles;
}

public void removeOrphanedFiles(long olderThanMillis, boolean dryRun) throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on current impl this method should be executed only when warehouse is s3 (not s3 tables, etc)


// Set<String> orphanedFiles = new HashSet<>();

allFiles.remove(knownFiles);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this supposed to be removeAll?

}
}

return knownFiles;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result looks incomplete. Inserting a single file into empty table yields

./metadata/00000-f62c927f-3965-42e9-8f3a-57809ca49f55.metadata.json
./metadata/4b00c8c2-6d25-4ba5-bf27-989fbdb9daa1-m0.avro
./metadata/snap-7186194791324723249-1-4b00c8c2-6d25-4ba5-bf27-989fbdb9daa1.avro
./metadata/00001-07fe7fa7-8fe1-4666-bd7b-7c88b9c861fe.metadata.json
./data/pickup_datetime_month=2009-01/1750715797525-2d861d7a4e5196bf04e5c3be506b0152f4170cf4e941424c8245b46903c3e252-part.parquet

i.e. we're going to nuke a bunch of critical files and break the catalog...

knownFiles.add(dataFile.path().toString());
}
} catch (Exception e) {
logger.error("Error getting list of data files", e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

network error here leads to us deleting files from catalog that are not supposed to be deleted

@subkanthi
Copy link
Collaborator Author

image

@subkanthi
Copy link
Collaborator Author

Tested by dropping a new file

2025-07-03 19:17:03 [-1-thread-1] INFO c.a.i.r.c.i.m.OrphanFileScanner > Looking for Orphaned files in location s3://bucket1/flowers/iris
2025-07-03 19:21:37 [-1-thread-1] INFO c.a.i.r.c.i.m.OrphanFileScanner > Found 1 orphaned files at s3://bucket1/flowers/iris!
2025-07-03 19:21:37 [-1-thread-1] INFO c.a.i.r.c.i.m.OrphanFileScanner > Deleted 1 orphaned files at s3://bucket1/flowers/iris!
2025-07-03 19:21:37 [-1-thread-1] INFO c.a.i.r.c.i.m.OrphanFileScanner > Deleted: s3://bucket1/flowers/iris/metadata/restore.sql

@subkanthi subkanthi marked this pull request as ready for review July 4, 2025 00:22
@subkanthi subkanthi requested a review from shyiko July 4, 2025 00:22
return;
}

if (dryRun) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dryRun should probably a config variable, so they can try before toggling it?


if (orphanFileExpirationDays == 0) {
logger.info(
"Skipping orphan file removal for table {} since orphanFileExpirationDays config was not set",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this can be avoided for every table if we have a separate iteration, not sure if this is the best option.

@shyiko
Copy link
Collaborator

shyiko commented Jul 9, 2025

we should probably default to

Unreferenced file removal identifies and deletes all objects that are not referenced by any table snapshots. As part of your unreferenced file removal policy, you can configure two properties: unreferencedDays (3 days by default) and nonCurrentDays (10 days by default).

to match https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-table-buckets-maintenance.html

@subkanthi
Copy link
Collaborator Author

subkanthi commented Jul 9, 2025

we should probably default to

Unreferenced file removal identifies and deletes all objects that are not referenced by any table snapshots. As part of your unreferenced file removal policy, you can configure two properties: unreferencedDays (3 days by default) and nonCurrentDays (10 days by default).

to match https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-table-buckets-maintenance.html

Setting default to 10 days for orphanFileExpirationDayssimilar to unreferencedDays
This functionality nonCurrentDays is not implemented, the files are removed after orphanFileExpirationDays

Comment on lines +52 to +65
FileIO io = table.io();

TableOperations ops = ((BaseTable) table).operations();
TableMetadata meta = ops.current();

String currentMetadataFile = meta.metadataFileLocation();
// Current metadata json file
knownFiles.add(currentMetadataFile);

// All the previous metadata JSON(is there a chance there might be
// json files that are not physically present?.
for (TableMetadata.MetadataLogEntry previousFile : meta.previousFiles()) {
knownFiles.add(previousFile.file());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same files are added on each loop iteration

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which ones, I dont see the same files
image

Copy link
Collaborator

@shyiko shyiko Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have a single snapshot; add another one to repro. but take a look at the selected code - it's fairly obvious the code is not connected to the outer for loop

for (DataFile dataFile : files) {
knownFiles.add(dataFile.path().toString());
}
} catch (Exception e) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: catch/throw

// All the previous metadata JSON(is there a chance there might be
// json files that are not physically present?.
for (TableMetadata.MetadataLogEntry previousFile : meta.previousFiles()) {
knownFiles.add(previousFile.file());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the result is missing statistics & partition statistics files


// All the previous metadata JSON(is there a chance there might be
// json files that are not physically present?.
for (TableMetadata.MetadataLogEntry previousFile : meta.previousFiles()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

files referenced by previous manifests are also missing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ice-rest-catalog: Automate catalog maintenance
2 participants